

# Summary of video

- Data-dependence
- How to handle data dependencies?
  - Detect and Wait
  - Data forwarding through register
  - Detect and forward
- In order and out of order execution
- Instruction reordering and renaming
- Loop unrolling

# CPI of a pipeline without stalls



CPI= No of clock cycles/instruction

Steady state CPI = (No of instructions + no of stalls) / No of instructions

## How to handle data dependencies

- Anti and output dependences are easier to handle
- True (Flow or RAW)dependences are more difficult to handle as they constitute true dependence on a value
  - Detect and wait until value is available in register file
    - Stall the program. (HARDWARE)
    - Compiler can also plug in the NOP instructions in between. (SOFTWARE)
  - Detect and forward / bypass data to dependent instruction

## Detect and wait

Time (clock cycles)



12

#### Hardware stall

| Instr. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|--------|---|---|---|---|---|---|---|---|---|
| I1     |   |   |   |   |   |   |   |   |   |
| 12     |   |   |   |   |   |   |   |   |   |

| Instr. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|--------|---|---|---|---|---|---|---|---|---|
| I1     |   |   |   |   |   |   |   |   |   |
|        |   |   |   |   |   |   |   |   |   |
|        |   |   |   |   |   |   |   |   |   |
|        |   |   |   |   |   |   |   |   |   |

# Data Forwarding - through register



Solution: write and read in the same cycle Most processors have this as it is easy to implement

# Detect and Forward / bypass

- Data forwarding.
- From pipeline stage registers to function units
- Forward data only in the clock cycle when they are required



## Detect and Forward / bypass

From pipeline stage registers to function units



# Data forwarding – example 2

# Without forwarding (writeback and decode can happen simultaneously)

11: ADD X1, X2, X3

12: LDUR X2, [X1, #0]

13: AND X6, X7, X1

|  | Clocks | 1  | 2  | 3  | 4 | 5  | 6  | 7  | 8  | 9  | 10 |  |  |
|--|--------|----|----|----|---|----|----|----|----|----|----|--|--|
|  | I1     | IF | ID | EX | M | WB |    |    |    |    |    |  |  |
|  | 12     |    | IF | S  | S | ID | EX | M  | WB |    |    |  |  |
|  | 13     |    |    |    |   | IF | ID | EX | M  | WB |    |  |  |

### With forwarding

| Clock cycle | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|-------------|---|---|---|---|---|---|---|
| <b>I</b> 1  |   |   |   |   |   |   |   |
| 12          |   |   |   |   |   |   |   |
| 13          |   |   |   |   |   |   |   |

Steady state CPI = (No of instructions + no of stalls) / No of instructions Steady state CPI (no forwarding) = Steady state CPI (forwarding) =

# Why steady state CPI?

# Branch is not handled here

#### Loop:

LDUR X0, [X1, #0] ADD X4, X0, X2 SUBI X1, X1, #8

CBNZ X1, Loop

| Clock | 1 | 2 | 3 | 4 | 5  | 6 | 7 | 8  | 9  | 10 | 11 | 12 |
|-------|---|---|---|---|----|---|---|----|----|----|----|----|
| I1    | F | D | Ε | М | WB |   |   |    |    |    |    |    |
| 12    |   | F | S | S | D  | Е | M | WB |    |    |    |    |
| 13    |   |   |   |   | F  | D | Е | М  | WB |    |    |    |
| 14    |   |   |   |   |    | F | S | S  | D  | Е  | М  | WB |

### CPI= Number of clocks /instruction= 12/4

| Clocks | 1 | 2 | 3 | 4 | 5  | 6 | 7 | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |   |
|--------|---|---|---|---|----|---|---|----|----|----|----|----|----|----|----|----|----|----|----|----|---|
| I1     | F | D | Е | М | WB |   |   |    |    |    |    |    |    |    |    |    |    |    |    |    | İ |
| 12     |   | F | S | S | D  | Ε | M | WB |    |    |    |    |    |    |    |    |    |    |    |    |   |
| 13     |   |   |   |   | F  | D | Е | M  | WB |    |    |    |    |    |    |    |    |    |    |    |   |
| 14     |   |   |   |   |    | F | S | S  | D  | M  | WB |    |    |    |    |    |    |    |    |    |   |
| 15     |   |   |   |   |    |   |   |    | F  | D  | Ε  | М  | WB |    |    |    |    |    |    |    |   |
| 16     |   |   |   |   |    |   |   |    |    | F  | S  | S  | D  | Е  | M  | WB |    |    |    |    |   |
| 17     |   |   |   |   |    |   |   |    |    |    |    |    | F  | D  | Е  | М  | WB |    |    |    |   |
| 18     |   |   |   |   |    |   |   |    |    |    |    |    |    | F  | S  | S  | D  | Е  | M  | WB |   |

CPI= Number of clocks /instruction= 20/8, Is 12/4 = 20/8?, Hence steady state CPI

## Performance improvement using Dynamic Scheduling

Assume full data forwarding

### Out-of-order processors:

#### After instruction decode

don't wait for previous instructions to execute if this instruction does not depend on them,
 i.e., independent ready instructions can execute before earlier instructions that are stalled

### in-order processors

LDUR **X1**, [X4, #100] ADD X2, **X1**, X4 SUB X5, X6, X7

## out-of-order processors

LDUR **X1**, [X4, #100] SUB X5, X6, X7 ADD X2, **X1**, X4(no stalls)



# Register Renaming

```
ADD
                   X4, X2, X1;
                                  X4 ← X2 + X1
WAR
        ANDI
                   X1, X0, #2; X1 ← X0 & 2
                 Rename X1 to X3
                   X4, X2, X1; X4 \leftarrow X2 + X1
        ADD
        AND
                   X3, X0, #2; X3 ← X0 & 2
WAW
        ADD
                                 X0 ← X2 + X1
                   X0, X2, X1;
        SUB
                   X0, X3, X5; X0 \leftarrow X3 - X5
                 Rename X0 to X4
                   X0, X2, X1; X0 \leftarrow X2 + X1
        ADD
        SUB
                   X4, X3, X5; X4 \leftarrow X3 - X5
```

More register resources will be needed

# Loop unrolling

- loop unrolling leads to multiple replications of the loop body
  - unrolling creates longer code sequences
  - goal is to execute iterations in parallel
- Example

```
for (i=0: i < 16; i++) {
c[i] = a[i]+ b[i];
}</pre>
```



```
for (i=0: i < 8; i++) {
c[2 * i] = a[2 * i] + b[2 * i];
c[2 * i+1] = a[2 * i+1] + b[2 * i+1];
}</pre>
```

- greater demand for registers
  - higher register pressure: more concurrency demands for more resources

How can loop unrolling help us to reduce the stalls and to improve CPI?

# Loop unrolling example

```
X0, [X1, #0];
   Loop:
             LDUR
                                           load to X0 from mem[0+X1]
             ADD
                       X4, X0, X2;
                                           add [X0]+[X2]
             STUR
                      X4, [X1, #0];
                                           store X4 to mem[0+X1]
                                                                      If data forwarding is not allowed,
                      X1, X1, #8;
             SUBI
                                           decrement pointer 8
                                                                       write back and decode of two
                                                                          instructions can happen
             CBNZ
                       X1,Loop;
                                           branch X1!=zero
                                                                              simultaneously
1. Loop: LDUR
                    X0, [X1, #0]
                    X4, X0, X2
2.
         ADD
3.
         STUR
                   X4, [X1, #0]
4.
         SUBI
                    X1, X1, #8;
                                                                       E
         CBNZ
5.
                   X1,Loop;
```

CPI = (No of instructions + no of stall) / No. of instruction =

# Loop unrolling example

```
1 Loop: LDUR
               X0, [X1, #0]
                            2 stall
               X4, X0, X2
        ADD
             X4, [X1, #0] 2 stall
        STUR
                                         ;drop SUBI & CBNZ
        LDUR
               X6, [X1, #-8]
               X8, X6, X2 2 stall
5
        ADD
               X8, [X1, #-8] 2 stall
        STUR
                                         ;drop SUBI & BNEZ
6
        LDUR
               X10, [X1,#-16]
               X12, X10, X2 2 stall
8
        ADD
               X12, [X1, #-16]2 stall
                                         ;drop SUBI & BNEZ
        STUR
               X14, [X1, #-24]
10
        LDUR
               X16, X14, X2 2 stall
        ADD
               X16, [X1, #-24] 2 stall
        STUR
12
13
        SUBI
                X1, X1, #32
                                         ;alter to 4*8
                            2 stall
14
        CBNZ
                X1,LOOP
```

Rewrite loop to minimize stalls?

STRAIGHT FORWARED UNROLLING, CPI=

# Branch is not handled here

# Loop unrolling with reordering

```
1 Loop: LDUR
                X0, [X1, #0]
        ADD
                X4, X0, X2
                             2 stall
                X4, [X1, #0] 2 stall
        STUR
        LDUR
                X6, [X1, #-8]
5
        ADD
                X8, X6, X2 2 stall
                X8, [X1, #-8] 2 stall
        STUR
6
        LDUR
                X10, [X1,#-16]
8
        ADD
                X12, X10, X2 2 stall
9
        STUR
                X12, [X1, #-16] 2 stall
10
        LDUR
                X14, [X1,#-24]
11
        ADD
                X16, X14, X2 2 stall
                X16, [X1, #-24] 2 stall
12
        STUR
13
        SUBI
                X1, X1, #32
14
        CBNZ
                X1,LOOP
                             2 stall
```

```
1 Loop: LDUR
                X0, [X1, #0]
                X6, [X1, #-8]
        LDUR
        LDUR
                X10, [X1,#-16]
                X14, [X1,#-24]
        LDUR
5
        ADD
                X4, X0, X2
6
        ADD
                X8, X6, X2
                X12, X10, X2
        ADD
8
        ADD
                X16, X14, X2
9
                X4, [X1, 0]
        STUR
        STUR
                X8, [X1, #-8]
10
                X12, [X1, #-16]
11
        STUR
                                    2 stall here
                X16, [X1, #-24]
        STUR
12
        SUBI
                X1, X1, #32 ←
13
        CBNZ
                X1,LOOP
14
```

No dataforwarding, WB and DEC happens simultaneously

# How data hazards can be eliminated?

### Summary

- Detect and wait (stalling unnecessarily)
- Data forwarding
- Reordering (out of order)
- Renaming (to remove WAR and WAW hazard)
- Loop unrolling and reordering

# Lab 2 (Quiz)

- 15 min open book quiz
- Fill in the blanks, T/F and MCQ
- Max 5 questions
- Rest details in the announcement.